A Framework for Measuring Differences in Data Characteristics
نویسندگان
چکیده
A data mining algorithm builds a model that captures interesting aspects of the underlying data. We develop a framework for quantifying the difference, called the deviation, between two datasets in terms of the models they induce. In addition to being a quantitative, intuitively interpretable measure of difference, the deviation between two datasets can also be computed very fast. Our framework covers a wide variety of models including frequent itemsets, decision tree classifiers, and clusters, and captures standard measures of deviation such as the misclassification rate and the chi-squared metric as special cases. We also show how statistical techniques can be applied to the deviation measure to assess whether the difference between two models Contact author. Supported by a Microsoft Graduate Fellowship. y This research was supported by Grant 2053 from the IBM corporation. z Supported by Army Research Office grant DAAG55-98-1-0333.
منابع مشابه
Taxonomy of Global Air Transport
Data from the United Nations and the International Civil Aviation Organization Information Systems were used as a base for characterizing, classifying and comparing air transport demand and supply features of 156 countries. Relevant data from 1980 were chosen to reflect five sets of characteristics namely, air transport, 50cm-economic status, population demography, geographical and environmenta...
متن کاملMeasuring the Effectiveness of Financial Literacy Programs in Ghana
This paper explores the effectiveness of financial literacy programs. It further seeks to establish the relationship between financial literacy and certain demographic characteristics. This study adopted a correlational research design as the framework to examine the relationship between variables without determining cause and effect. Data were randomly collected from 235 petty traders in Kumas...
متن کاملTEXTUAL AND INTER-TEXTUAL ANALYSES OF IRANIAN EFL UNDERGRADUATES’ TYPES OF ENGLISH READING TOWARDS DEVELOPING A CAREFUL READING FRAMEWORK
This study investigated textual and inter-textual reading of a group of Iranian EFL undergraduates’ careful English reading types. In this research, Khalifa and Weir’s (2009) reading framework was used to propose a more inclusive aspect of a careful reading framework and the reading construct for instructional and assessment goals. The participants of this study were B.A. students of English Tr...
متن کاملIntegrating the Population Perspective into Health System Performance Assessment (IPHA): Study Protocol for a Cross-Sectional Study in Germany Linking Survey and Claims Data of Statutorily and Privately Insured
Background Health system performance assessment (HSPA) is a major tool for evidence-based governance in health systems and patient/population-orientation is increasingly considered as an important aspect. The IPHA study aims (1) to undertake a comprehensive performance assessment of the German health system from a population perspec...
متن کاملRetaining Customers Using Clustering and Association Rules in Insurance Industry: A Case Study
This study clusters customers and finds the characteristics of different groups in a life insurance company in order to find a way for prediction of customer behavior based on payment. The approach is to use clustering and association rules based on CRISP-DM methodology in data mining. The researcher could classify customers of each policy in three different clusters, using association rules. A...
متن کاملDEVELOPMENT OF PRODUCTIVITY MEASUREMENT AND ANALYSIS FRAMEWORK FOR MANUFACTURING COMPANIES
The purpose of this research is to present an alternative approach for measuring productivity in manufacturing companies. To achieve the research objective, an in depth investigation on the existing productivity measurement and analysis practices of a case manufacturing company has been carried out through both qualitative and quantitative approaches. The investigation result has shown that the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Comput. Syst. Sci.
دوره 64 شماره
صفحات -
تاریخ انتشار 2002